Programming Massively Parallel Processors: A Hands-on Approach: The CUDA Execution Model: Host vs. Device

The CUDA execution model transforms your computer into a high-performance heterogeneous system. Imagine a Grand Director (the Host/CPU) and an Army of Thousands (the Device/GPU). The Director handles complex logic and decision-making, while the Army performs massive, repetitive tasks simultaneously.

1. The Architectural Divide

The Host is a latency-optimized CPU designed for complex control flow and serial tasks. Conversely, the Device is a throughput-optimized GPU containing thousands of simple cores designed to execute the same instruction across vast datasets simultaneously.

2. The Execution Rhythm

A CUDA program functions as a series of phases. Execution begins on the Host for "serial code." When the program hits a "Parallel Kernel," it launches a Grid of threads onto the Device. Control returns to the Host once the Device finishes its massive workload.

3. Performance Specialization

The model leverages the strengths of both: the CPU manages system resources and complex branches, while the GPU executes SPMD (Single-Program, Multiple-Data) logic to process data elements in parallel.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.

Case Study: High-Resolution Fluid Dynamics

Optimizing a Heterogeneous Simulation

You are developing a fluid dynamics engine. The simulation involves: (A) Calculating the user interface and file logging, (B) Computing the pressure gradients for 20 million fluid cells, and (C) Updating the simulation time-step based on global convergence tests. You must decide how to map these tasks to the CUDA execution model.

1. Which task (A, B, or C) should definitely remain on the Host, and why?

Solution:
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.

2. How does the 'alternating phases' concept apply to the interaction between tasks B and C?

Solution:
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.